9 research outputs found

    Biomedical Literature Mining and Knowledge Discovery of Phenotyping Definitions

    Get PDF
    Indiana University-Purdue University Indianapolis (IUPUI)Phenotyping definitions are essential in cohort identification when conducting clinical research, but they become an obstacle when they are not readily available. Developing new definitions manually requires expert involvement that is labor-intensive, time-consuming, and unscalable. Moreover, automated approaches rely mostly on electronic health records’ data that suffer from bias, confounding, and incompleteness. Limited efforts established in utilizing text-mining and data-driven approaches to automate extraction and literature-based knowledge discovery of phenotyping definitions and to support their scalability. In this dissertation, we proposed a text-mining pipeline combining rule-based and machine-learning methods to automate retrieval, classification, and extraction of phenotyping definitions’ information from literature. To achieve this, we first developed an annotation guideline with ten dimensions to annotate sentences with evidence of phenotyping definitions' modalities, such as phenotypes and laboratories. Two annotators manually annotated a corpus of sentences (n=3,971) extracted from full-text observational studies’ methods sections (n=86). Percent and Kappa statistics showed high inter-annotator agreement on sentence-level annotations. Second, we constructed two validated text classifiers using our annotated corpora: abstract-level and full-text sentence-level. We applied the abstract-level classifier on a large-scale biomedical literature of over 20 million abstracts published between 1975 and 2018 to classify positive abstracts (n=459,406). After retrieving their full-texts (n=120,868), we extracted sentences from their methods sections and used the full-text sentence-level classifier to extract positive sentences (n=2,745,416). Third, we performed a literature-based discovery utilizing the positively classified sentences. Lexica-based methods were used to recognize medical concepts in these sentences (n=19,423). Co-occurrence and association methods were used to identify and rank phenotype candidates that are associated with a phenotype of interest. We derived 12,616,465 associations from our large-scale corpus. Our literature-based associations and large-scale corpus contribute in building new data-driven phenotyping definitions and expanding existing definitions with minimal expert involvement

    Best Practices for Health Informatician Involvement in Interprofessional Health Care Teams

    Get PDF
    Academic and nonacademic health informatics (HI) professionals (informaticians) serve on interprofessional health care teams with other professionals, such as physicians, nurses, pharmacists, dentists, and nutritionists. Presently, we argue for investing greater attention to the role health informaticians play on interprofessional teams and the best practices to support this role

    PhenoDEF: a corpus for annotating sentences with information of phenotype definitions in biomedical literature

    Get PDF
    Background Adverse events induced by drug-drug interactions are a major concern in the United States. Current research is moving toward using electronic health record (EHR) data, including for adverse drug events discovery. One of the first steps in EHR-based studies is to define a phenotype for establishing a cohort of patients. However, phenotype definitions are not readily available for all phenotypes. One of the first steps of developing automated text mining tools is building a corpus. Therefore, this study aimed to develop annotation guidelines and a gold standard corpus to facilitate building future automated approaches for mining phenotype definitions contained in the literature. Furthermore, our aim is to improve the understanding of how these published phenotype definitions are presented in the literature and how we annotate them for future text mining tasks. Results Two annotators manually annotated the corpus on a sentence-level for the presence of evidence for phenotype definitions. Three major categories (inclusion, intermediate, and exclusion) with a total of ten dimensions were proposed characterizing major contextual patterns and cues for presenting phenotype definitions in published literature. The developed annotation guidelines were used to annotate the corpus that contained 3971 sentences: 1923 out of 3971 (48.4%) for the inclusion category, 1851 out of 3971 (46.6%) for the intermediate category, and 2273 out of 3971 (57.2%) for exclusion category. The highest number of annotated sentences was 1449 out of 3971 (36.5%) for the “Biomedical & Procedure” dimension. The lowest number of annotated sentences was 49 out of 3971 (1.2%) for “The use of NLP”. The overall percent inter-annotator agreement was 97.8%. Percent and Kappa statistics also showed high inter-annotator agreement across all dimensions. Conclusions The corpus and annotation guidelines can serve as a foundational informatics approach for annotating and mining phenotype definitions in literature, and can be used later for text mining applications

    Public Perceptions around mHealth Applications during COVID-19 Pandemic: A Network and Sentiment Analysis of Tweets in Saudi Arabia

    No full text
    A series of mitigation efforts were implemented in response to the COVID-19 pandemic in Saudi Arabia, including the development of mobile health applications (mHealth apps) for the public. Assessing the acceptability of mHealth apps among the public is crucial. This study aimed to use Twitter to understand public perceptions around the use of six Saudi mHealth apps used during COVID-19: “Sehha”, “Mawid”, “Sehhaty”, “Tetamman”, “Tawakkalna”, and “Tabaud”. We used two methodological approaches: network and sentiment analysis. We retrieved Twitter data using specific mHealth apps-related keywords. After including relevant tweets, our final mHealth app networks consisted of a total of 4995 Twitter users and 8666 conversational relationships. The largest networks in size (i.e., the number of users) and volume (i.e., the conversational relationships) among all were “Tawakkalna” followed by “Tabaud”, and their conversations were led by diverse governmental accounts. In contrast, the four remaining mHealth networks were mainly led by the health sector and media. Our sentiment analysis approach included five classes and showed that most conversations were neutral, which included facts or information pieces and general inquires. For the automated sentiment classifier, we used Support Vector Machine with AraVec embeddings as it outperformed the other tested classifiers. The sentiment classifier showed an accuracy, precision, recall, and F1-score of 85%. Future studies can use social media and real-time analytics to improve mHealth apps’ services and user experience, especially during health crises

    Analyzing Patterns of Literature-Based Phenotyping Definitions for Text Mining Applications

    Get PDF
    Phenotyping definitions are widely used in observational studies that utilize population data from Electronic Health Records (EHRs). Biomedical text mining supports biomedical knowledge discovery. Therefore, we believe that mining phenotyping definitions from the literature can support EHR-based clinical research. However, information about these definitions presented in the literature is inconsistent, diverse, and unknown, especially for text mining usage. Therefore, we aim to analyze patterns of phenotyping definitions as a first step toward developing a text mining application to improve phenotype definition. A set random of observational studies was used for this analysis. Term frequency-inverse document frequency (TF-IDF) and Term Frequency (TF) were used to rank the terms in the 3958 sentences. Finally, we present preliminary results analyzing phenotyping definitions patterns

    U.S. Hospitals' Web-Based Patient Engagement Activities

    No full text
    Digitized for IUPUI ScholarWorks inclusion in 2021.The purpose of this poster is to describe how U.S. Hospitals use their websites to meet the National e-Health Collaborative (NeHC) patient engagement criteria and to explore trends, challenges, opportunities for hospitals when it comes to leveraging websites for patient engagement

    Bibliometric analysis of coronavirus disease 2019 medical research production from Saudi Arabia

    No full text
    In response to the coronavirus disease 2019 (COVID-19) pandemic and as an attempt to fill the knowledge gap related to this topic, extensive research in this area has been published. The Kingdom of Saudi Arabia, as one of the countries affected by the pandemic, has contributed to the growing body of scientific literature related to COVID-19. To measure research contribution produced by Saudi Arabian affiliated researchers, we conducted a bibliometric analysis of COVID-19 research across two databases: the Web of Science and Scopus. Analysis was conducted on February 2, 2021 and included all relevant COVID-19-related publications (n = 1510). The majority of publications were research articles with the average number of citations per publication of 4.8, an authorship collaboration dimension of 5.49 authors per publication, collaboration index of 6.0, and 5.6 mean total citations per year. Further analysis showed that 89.5% of publications were multi-authored reflecting the importance of substantial efforts made toward research collaboration

    Real-World Evidence of COVID-19 Patients’ Data Quality in the Electronic Health Records

    No full text
    Despite the importance of electronic health records data, less attention has been given to data quality. This study aimed to evaluate the quality of COVID-19 patients’ records and their readiness for secondary use. We conducted a retrospective chart review study of all COVID-19 inpatients in an academic healthcare hospital for the year 2020, which were identified using ICD-10 codes and case definition guidelines. COVID-19 signs and symptoms were higher in unstructured clinical notes than in structured coded data. COVID-19 cases were categorized as 218 (66.46%) “confirmed cases”, 10 (3.05%) “probable cases”, 9 (2.74%) “suspected cases”, and 91 (27.74%) “no sufficient evidence”. The identification of “probable cases” and “suspected cases” was more challenging than “confirmed cases” where laboratory confirmation was sufficient. The accuracy of the COVID-19 case identification was higher in laboratory tests than in ICD-10 codes. When validating using laboratory results, we found that ICD-10 codes were inaccurately assigned to 238 (72.56%) patients’ records. “No sufficient evidence” records might indicate inaccurate and incomplete EHR data. Data quality evaluation should be incorporated to ensure patient safety and data readiness for secondary use research and predictive analytics. We encourage educational and training efforts to motivate healthcare providers regarding the importance of accurate documentation at the point-of-care
    corecore